-
Notifications
You must be signed in to change notification settings - Fork 38
Create RFC for health endpoint #141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Kump3r <[email protected]>
| "status": "healthy/unhealthy", | ||
| "details": { | ||
| "database": "healthy/unhealthy", | ||
| "workers": "healthy/unhealthy", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"workers" as a single entry does not convey much meaningful information IMO. A list of the status of each worker might be more useful. Also, the semantics of the general "status" should be clarified. What is considered a healthy instance ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah haven't though about it that way, thanks for the input. Extending it even a bit further with our 1on1 discussion, we might not even need information about the workers or database, but more or less whether API is working and whether workloads are schedulable. So not looking at specific interfaces, or services bu more or less. Is the ATC working and if so, can it schedule workloads. An example that comes to mind is a systematic/periodic one off build which is tracked by this backend and reports a simple "run-jobs: healthy". Should the API be not reachable the endpoint will be down anyway. So in that case it would look like:
"status": "healthy",
"run-jobs": "healthy"
Should it fail to tun jobs in a certain time frame the status will change to unhealthy.
Does that more or less sum it up, or am I missing something?
|
I had an offline discussion with a stakeholder who also raised a valid question that can be added to the document:
Thanks for the question I will also add this to the document when a bit more questions are gathered! |
|
Also part of an offline follow-up with another user:
I think a GUI change is a bit out of scope of this RFC, albeit this RFC would enable this to be easily extended in the UI, so it is worth writing it done as a possible future follow-up |
|
Totally for this. A lot of my questions come around implementation, which I see already written down in the POC PR concourse/concourse#4818. I think it would be nice for this RFC to define specifically what we want the Health JSON response to look like. A Concourse web node is made up of a bunch of micro-service-ish components. We could potentially display the health of all of these components (see components.go). There may be some exceptions in that file, but most of these components are run "globally" across one of the web nodes based on workload the web node is handling. They're load-balanced! There are some services on the web node that are not load-balanced, like the TSA and API. Those are always running on all web nodes. A detailed health response could look something like this, which I think would accurately describe the entire Concourse cluster: {
"status": "...",
"workers": {
"worker-1": {
"baggageclaim": "...",
"garden": "..."
}
...
},
"web-nodes": {
"web-1": {
"api": "...",
"tsa": "...",
"db-connection": "...",
...
}
...
},
"global-components": {
"log-collector": "...",
"lidar": "...",
"secret-management": "...",
"scheduler": "..."
...
}
}I wouldn't expect an initial PR to fully implement all of that though. I think this RFC could clearly define what we want the end goal to look like and then slowly work towards it through multiple PR's. WDYT? |
|
I agree, I really like the idea and I am all for having an easy to reach status board for all of the components. One of the key questions that come to mind is when the overall status should change to not healthy, as albeit each component having its share of work to be done, if they flap, or are unstable it shouldn’t mean the instance is not operational, but rather somewhat degraded. So more, or less building on what you wrote, it would be great before closing the RFC to have the json response and the conditions that are hard requirement for a healthy instance figured out. Thanks to all for the feedback, I like the overall direction of the discussions here. Once we have q couple of more comments, I will add all the discussions to the document. |
|
If you plan to reuse the same endpoint for Kubernetes health checks, you can introduce a parameter to differentiate between web and worker nodes. For example:
It could also be extended to the pod level, such as:
This way, Kubernetes can identify and restart individual pods if they become unhealthy. |
Signed-off-by: Kump3r <[email protected]>
Rendered
Previously discussed as well, but I think it makes sense to collect comments on this again, due to industry standards and an overall need for this. Should we open a discussion to collect interest/opinions of the community as well?